set.seed(12345)
source("ingest_data.R")
MODEL_CACHE <- "added-M_intercepts_only"
Models are stored in the following directory, which must exist prior to knitting this document:
cat(normalizePath(paste(getwd(), dirname(cachefile(MODEL_CACHE)), sep="/"), mustWork = T))
## /home/app/.cache
The used cache directory can be controlled via the cache parameter to Rmd - it can be useful to experiment with this parameter if you Knit the document manually in RStudio.
The simplest possible model, only intercepts (on population, committerteam and team-in-repo level):
This model assumes that each team-repository combination has a unique intercept, based on the population-level intercept, plus a team-level, and team-per-repo-level offset from that intercept. This model does not take into account any statistics about the actual change - it only assumes “when team T changes in repo R, then the likelihood of introducing duplicates are y”.
We use identical formulas and priors for the zero-inflation part of the equation (as the number of zeros dominate in the data).
This is our formula, including the brms default priors, which we will change below.
d <- data |> select(y=INTROD,
team=committerteam,
repo=repo)
formula <- bf(y ~ 1 + (1 | team) + (1 | team:repo),
zi ~ 1 + (1 | team) + (1 | team:repo))
get_prior(data=d,
family=zero_inflated_negbinomial,
formula=formula)
## prior class coef group resp dpar nlpar lb ub
## student_t(3, -2.3, 2.5) Intercept
## student_t(3, 0, 2.5) sd 0
## student_t(3, 0, 2.5) sd team 0
## student_t(3, 0, 2.5) sd Intercept team 0
## student_t(3, 0, 2.5) sd team:repo 0
## student_t(3, 0, 2.5) sd Intercept team:repo 0
## gamma(0.01, 0.01) shape 0
## logistic(0, 1) Intercept zi
## student_t(3, 0, 2.5) sd zi 0
## student_t(3, 0, 2.5) sd team zi 0
## student_t(3, 0, 2.5) sd Intercept team zi 0
## student_t(3, 0, 2.5) sd team:repo zi 0
## student_t(3, 0, 2.5) sd Intercept team:repo zi 0
## source
## default
## default
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## default
## default
## default
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
We adjust the priors accordingly:
priors <- c(prior(normal(0, 0.5), class = Intercept),
prior(weibull(2, .25), class = sd),
prior(normal(0, 0.5), class = Intercept, dpar=zi),
prior(weibull(2, 0.25), class = sd, dpar=zi),
prior(gamma(0.5, 0.1), class = shape))
validate_prior(prior=priors,
formula=formula,
data=d,
family=zero_inflated_negbinomial)
## prior class coef group resp dpar nlpar lb ub
## normal(0, 0.5) Intercept
## normal(0, 0.5) Intercept zi
## weibull(2, 0.25) sd 0
## weibull(2, 0.25) sd zi 0
## weibull(2, 0.25) sd team 0
## weibull(2, 0.25) sd Intercept team 0
## weibull(2, 0.25) sd team zi 0
## weibull(2, 0.25) sd Intercept team zi 0
## weibull(2, 0.25) sd team:repo 0
## weibull(2, 0.25) sd Intercept team:repo 0
## weibull(2, 0.25) sd team:repo zi 0
## weibull(2, 0.25) sd Intercept team:repo zi 0
## gamma(0.5, 0.1) shape 0
## source
## user
## user
## user
## user
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## user
Using the brms default shape prior (gamma(0.01,0.01)) shows that we get unreasonably many zeros, and the max observed value are also incredibly large (3e7 or more). Because of this, we also have very few observations in the plausible range (1-1000 duplicates). We also get some (1%) divergent transitions. Thus, we need to tighten that prior to some more realistic shape. After some experimenting, we arrive at gamma(0.5, 0.1) as a good compromise.
M_ppc <- brm(data = d,
family = zero_inflated_negbinomial,
formula = formula,
prior = priors,
file = cachefile(paste0("ppc-",MODEL_CACHE)),
warmup = 1000,
iter = ITERATIONS,
chains = CHAINS,
cores = CORES,
sample_prior = "only",
backend="cmdstanr",
threads = threading(THREADS),
save_pars = SAVE_PARS,
adapt_delta = ADAPT_DELTA)
m <- M_ppc
yrep <- posterior_predict(m)
We expect the number of zeros (changes that do not introduce duplicates) to dominate, as clone introduction typically would be a rare event.
ppc_stat(y = d$y, yrep, stat = function(y) mean(y == 0)) + ggtitle("Prior predicted proportion of zeros")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Our prior for the zero-inflation is not particularly strong - but at least we assume that the overall ratio of zeros should be over half the changes - somewhat lower than the \(\approx 95\)% present in the data.
While our priors place the maximum likely value towards the lower end of the thousands, they do not rule out even 150,000 duplicates in a single change (which we think is quite unrealistic). But it is important to allow some very unlikely values in the priors, because parameter space where prior information is zero will never show up in the posterior, even if it is present in the data (following Bayes’ theorem). Thus, even though the priors show some unreasonable values, at least we know that if we ever see such data, our model will allow us to see it.
(sim_max <- ppc_stat(y = d$y, yrep, stat = "max") + ggtitle("Prior predicted max values")
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Scaling to more reasonable values show that our model actually expects somewhat smaller max values than observed, on average - but we see above that it does encompass also the observed max value, and quite a bit more range.
sim_max + scale_x_continuous(limits = c(0,1000)) + ggtitle("Prior predicted max values up to 1000")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The 99th percentile is a more stable metric than the maximum value for highly right-skewed data like ours. The priors do a good job of matching the observations.
ppc_stat(y = d$y, yrep, stat = "q99") + ggtitle("Prior predicted Q99 vs. observed value")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can even plot the 95th percentile versus the 99th, just to show how the spread of likely values vary.
(p <- ppc_stat_2d(d$y, yrep, stat = c("q95", "q99")) + theme(legend.position="bottom") + ggtitle("Prior predicted Q95 vs Q99")
)
The standard deviation of the predictions is a bit harder to grasp intuitively. But our prior information encompass the observed value well.
(p <- ppc_stat(y = d$y, yrep, stat = "sd") + ggtitle("Prior predicted stddev vs. observed value")
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Zooming in to show the distribution in more detail.
p + scale_x_continuous(limits=c(0,30))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can use groups present in the data to show how predictions stack up to observations made per group. Here, beside showing the distribution, it is also important to consider the scale on the respective \(x\)-axis. Teams with few, and varied, observations (such as UI and Unknown) are more reliant on the priors, and will have a larger range of predictions.
ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$team) + theme(legend.position = "bottom") + ggtitle("Prior predictive Q99 observation per team")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Plotting prior predictions per repo can show how our model compares relative to repo-level metrics. We see that Integration tests are a bit of an anomaly. Our model does not incorporate any repo-level group (only the team:repo-level grouping, that is, how each team behaves in a particular repo), so all predictions will follow the same distribution. Note how all \(x\)-axes are quite similar in scale (the difference comes from random variations, and differing number of data points arising from each reposiotory)
ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$repo) + theme(legend.position = "bottom") + ggtitle("Prior predictive Q99 observation per repository")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Model execution
M_intercepts_only <-
brm(data = d,
family = zero_inflated_negbinomial,
file = cachefile(MODEL_CACHE),
formula = formula,
prior = priors,
warmup = 1000,
iter = ITERATIONS,
chains = CHAINS,
cores = CORES,
backend="cmdstanr",
file_refit = "on_change",
threads = threading(THREADS),
save_pars = SAVE_PARS,
adapt_delta = ADAPT_DELTA)
M_intercepts_only <- add_criterion(M_intercepts_only, criterion = "loo")
m <- M_intercepts_only
p <- mcmc_trace(m)
pars <- levels(p[["data"]][["parameter"]])
plots <- seq(1, to=length(pars), by=12)
lapply(plots, function(i) {
start <- i
end <- start+11
mcmc_trace(m, pars = na.omit(pars[start:end]))
})
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]
##
## [[17]]
The “caterpillar plots” show that the chains mixed well, and explored the available parameter space.
mcmc_plot(m, type="rhat")
mcmc_plot(m, type="rhat_hist")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mcmc_plot(m, type="neff")
mcmc_plot(m, type="neff_hist")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Both MCMC chains, \(\hat{R}\) and \(N_{eff}\) ratio looks good.
loo <- loo(m)
loo
##
## Computed from 12000 by 31007 log-likelihood matrix.
##
## Estimate SE
## elpd_loo -10708.1 198.3
## p_loo 114.9 11.4
## looic 21416.2 396.5
## ------
## MCSE of elpd_loo is NA.
## MCSE and ESS estimates assume MCMC draws (r_eff in [0.7, 1.9]).
##
## Pareto k diagnostic values:
## Count Pct. Min. ESS
## (-Inf, 0.7] (good) 31002 100.0% 122
## (0.7, 1] (bad) 3 0.0% <NA>
## (1, Inf) (very bad) 2 0.0% <NA>
## See help('pareto-k-diagnostic') for details.
plot(loo)
We have 5 observations with high Pareto k values. These might be highly influential points, and should be investigated by the reloo function. This function will, in turn, exclude each suspicious data point, and refit the model, comparing the result with the original model). Most likely due to sparse data for some observations.
If these calculations produce reasonable Pareto values (< 0.7), then we can still trust the results. Reloo is disabled by default, but could be enabled by setting the RELOO environment variable (which in turn will set the reloo parameter of this document.
reloofile <- cachefile(paste0("reloo-", MODEL_CACHE, ".rds"))
if (file.exists(reloofile)) {
(reloo <- readRDS(reloofile))
} else {
Sys.time()
(reloo <- reloo(m, loo, chains=CHAINS, cores=CORES) )
Sys.time()
saveRDS(reloo, reloofile)
}
##
## Computed from 12000 by 31007 log-likelihood matrix.
##
## Estimate SE
## elpd_loo -10710.3 198.5
## p_loo 117.1 12.9
## looic 21420.6 397.1
## ------
## MCSE of elpd_loo is 0.5.
## MCSE and ESS estimates assume MCMC draws (r_eff in [0.7, 1.9]).
##
## All Pareto k estimates are good (k < 0.7).
## See help('pareto-k-diagnostic') for details.
Note that the reloo results are cached, and included in the docker image by default. If you want to reproduce the reloo manually, set a separate cache dir (and mount it into the docker image), and enable the RELOO environment variable. It will take several hours, so it is best done over night.
# plotting the recalculated values
plot(reloo)
# which points have higher pareto_k than 0.5?
influencers <- data[loo::pareto_k_ids(reloo),]
influencers |> group_by(repo,committerteam) |> tally()
## # A tibble: 0 × 3
## # Groups: repo [0]
## # ℹ 3 variables: repo <fct>, committerteam <fct>, n <int>
The reloo function reveals that the parameters are good, so we should be able to trust the model inferences. ## Posterior predictive checks
yrep <- posterior_predict(m)
ppc_stat(y = d$y, yrep, stat = function(y) mean(y == 0)) + ggtitle("Posterior predicted proportion of zeros")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of zeros is spot-on.
sim_max <- ppc_stat(y = d$y, yrep, stat = "max") + ggtitle("Posterior predicted max value")
sim_max
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The most likely posterior max value is slightly below 100, and the observed value (from IntTest repository) is around 150. But our priors are not too strong, our models place some probability up to a few hundred clones. This means that we could trust that our model—in case such data should show up—will consider that data as well.
ppc_stat(y = d$y, yrep, stat = "sd") + ggtitle("Posterior predicted standard deviation")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Most of the probability mass are between 1.1nd 2 in standard deviation, slightly below the observed value.
ppc_stat_2d(d$y, yrep, stat = c("q95", "q99")) + ggtitle("Posterior predicted Q95 vs Q99")
The model converges very well around the posterior observation of 1 as 95th percentile, and 5 as 99th percentile, while still allowing between 4 and 7 as 99th percentile.
ppc_stat_grouped(y = d$y, yrep, stat = "max", group = d$team) + ggtitle("Posterior predictive max observation per team")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The max value per team varies between teams, but for most teams the observed value fall reasonably well within the predictions. Red team is the exception. Remember, this model does not take any numerical predictor (such as size of the change) into account, so different team sensitivity to such predictors will not be visible.
ppc_stat_grouped(y = d$y, yrep, stat = "max", group = d$repo) + ggtitle("Posterior predictive max observation per repo")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Posterior max per repository have some variation, in particular in Saturn and Jupiter.
ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$team) + ggtitle("Posterior predictive 99% quantile per team")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The 99th quantile are more stable metrics than the max value. The UI team shows that they are more reliant on priors (have a much wider \(x\) axis range in their predictions)
ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$repo) + ggtitle("Posterior predictive 99% quantile per repo")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Similar conclusions for the 99th quantile per repository.
source("predict_utils.R")
# the predict functions use predictors, so we have to supply them. But this model do not consider these values, only the team and the repository, and the intercept
heatmap_by_team_and_repo(posterior_predict_by_team_and_repo(m, added=q95(data$ADD), removed=q95(data$DEL), complexity=q95(data$COMPLEX), duplicates = q95(data$DUP), summary=function(x) q99(x)), "Quantile 99%", decimals=0) + ggtitle("Quantile 99% per team and repo", "")
Apart from the Integration Test, and to some extent Neptune, this model does not really make much difference between teams in the various repositories.
Our model converges well, but is limited in its predictive capabilities. It will be used as a baseline model.